System, method, and computer program product for segmenting a database based, at least in part, on a prevalence associated with known objects included in the database

ABSTRACT

A system, method, and computer program product are provided for segmenting a database based, at least in part, on a prevalence associated with known objects included in the database. In operation, a database including a plurality of known objects is identified. Additionally, the database is segmented into a plurality of segments. Furthermore, each of the plurality of known objects are assigned to one of the plurality of segments, based at least in part on a prevalence associated with each of the plurality of known objects.

FIELD OF THE INVENTION

The present invention relates to networks, and more particularly toincreasing the efficiency of networks with the growing increase of cloudbased technologies.

BACKGROUND

In the context of network security, the threat landscape has grownexponentially over the last few years. The threat landscape has grown somuch that most Anti-Virus vendors are evaluating and implementingvarious technologies to mitigate the unmatched growth in the number ofthreats. As the threat landscape grows, so does the need to mitigate thethreats associated with that growth.

Currently, the number of network based lookups required in a networkcloud is very high. These network based lookups include performingsignature lookups across a network for each file scanned on a system(e.g. a client computer, etc.). Thus, as the number of threats increase,the number of lookups required to ensure the network is secure alsoincreases.

In some cases, however, it may be desirable to keep the lookup rate toless than a certain number of lookups per day. For example, it may bedesirable to keep the lookup rate to less than ten lookups per day perclient. Thus, harsh criteria is often used to keep the lookup rates low.As a result, many problematic items (e.g. malware, infected files, etc.)are not examined and such items are missed on the client systems. Thereis thus a need for overcoming these and/or other issues associated withthe prior art.

SUMMARY

A system, method, and computer program product are provided forsegmenting a database based, at least in part, on a prevalenceassociated with known objects included in the database. In operation, adatabase including a plurality of known objects is identified.Additionally, the database is segmented into a plurality of segments.Furthermore, each of the plurality of known objects are assigned to oneof the plurality of segments, based at least in part on a prevalenceassociated with each of the plurality of known objects.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 illustrates a network architecture, in accordance with oneembodiment.

FIG. 2 shows a representative hardware environment that may beassociated with the servers and/or clients of FIG. 1, in accordance withone embodiment.

FIG. 3 shows a method for segmenting a database based, at least in part,on a prevalence associated with known objects included in the database,in accordance with one embodiment.

FIG. 4 shows a method for reducing a number of signature lookupsrequired by a system, in accordance with one embodiment.

FIG. 5 shows a system for reducing a number of signature lookupsrequired by a system and for segmenting a database based, at least inpart, on a prevalence associated with known objects included in thedatabase, in accordance with one embodiment.

DETAILED DESCRIPTION

FIG. 1 illustrates a network architecture 100, in accordance with oneembodiment. As shown, a plurality of networks 102 is provided. In thecontext of the present network architecture 100, the networks 102 mayeach take any form including, but not limited to a local area network(LAN), a wireless network, a wide area network (WAN) such as theInternet, peer-to-peer network, etc.

Coupled to the networks 102 are servers 104 which are capable ofcommunicating over the networks 102. Also coupled to the networks 102and the servers 104 is a plurality of clients 106. Such servers 104and/or clients 106 may each include a desktop computer, lap-topcomputer, hand-held computer, mobile phone, personal digital assistant(PDA), peripheral (e.g. printer, etc.), any component of a computer,and/or any other type of logic. In order to facilitate communicationamong the networks 102, at least one gateway 108 is optionally coupledtherebetween.

FIG. 2 shows a representative hardware environment that may beassociated with the servers 104 and/or clients 106 of FIG. 1, inaccordance with one embodiment. Such figure illustrates a typicalhardware configuration of a workstation in accordance with oneembodiment having a central processing unit 210, such as amicroprocessor, and a number of other units interconnected via a systembus 212.

The workstation shown in FIG. 2 includes a Random Access Memory (RAM)214, Read Only Memory (ROM) 216, an I/O adapter 218 for connectingperipheral devices such as disk storage units 220 to the bus 212, a userinterface adapter 222 for connecting a keyboard 224, a mouse 226, aspeaker 228, a microphone 232, and/or other user interface devices suchas a touch screen (not shown) to the bus 212, communication adapter 234for connecting the workstation to a communication network 235 (e.g., adata processing network) and a display adapter 236 for connecting thebus 212 to a display device 238.

The workstation may have resident thereon any desired operating system.It will be appreciated that an embodiment may also be implemented onplatforms and operating systems other than those mentioned. Oneembodiment may be written using JAVA, C, and/or C++ language, or otherprogramming languages, along with an object oriented programmingmethodology. Object oriented programming (OOP) has become increasinglyused to develop complex applications.

Of course, the various embodiments set forth herein may be implementedutilizing hardware, software, or any desired combination thereof. Forthat matter, any type of logic may be utilized which is capable ofimplementing the various functionality set forth herein.

FIG. 3 shows a method 300 for segmenting a database based, at least inpart, on a prevalence associated with known objects included in thedatabase, in accordance with one embodiment. As an option, the method300 may be implemented in the context of the architecture andenvironment of FIGS. 1 and/or 2. Of course, however, the method 300 maybe carried out in any desired environment.

As shown, a database including a plurality of known objects isidentified. See operation 302. The objects may include any item capableof being stored in the database. For example, in various embodiments,the known objects may include files or programs.

In either case, the known objects may include objects that are known tobe non-malicious. In this case, the database may include a whitelistdatabase. In the context of the present description, a whitelist refersto any data structure that identifies one or more objects that are knownto be non-malicious objects.

As another option, the database may include a blacklist database. In thecontext of the present description, a blacklist refers to any datastructure that identifies one or more objects that are known to bemalicious, unsafe, or undesirable objects. In this case, the knownobjects may include objects that are known to be malicious.

In one embodiment, the objects may include whitelisted objects. As anoption, the whitelisted objects may be defined utilizing a Bloom filter.In this case, the Bloom filter may be utilized as a whitelist to offseta high false positive rate.

As shown further in FIG. 3, the database is segmented into a pluralityof segments. See operation 304. The database may be segmented into anynumber of segments.

Furthermore, each of the plurality of known objects are assigned to oneof the plurality of segments, based at least in part on a prevalenceassociated with each of the plurality of known objects. See operation306. The prevalence may be indicative of an amount the object isaccessed and/or utilized.

For example, the prevalence may include a high prevalence or a lowprevalence. In this case, a high prevalence may indicate that an objectis accessed and/or utilized regularly, or more than a predeterminedamount. A low prevalence may indicate that an object is not accessedand/or utilized frequently, or less than a predetermined amount. In oneembodiment, the prevalence information may be obtained utilizing clientsystem based antivirus software.

The segments may then be allocated such that at least one of thesegments corresponds to low prevalence objects. Additionally, at leastone of the segments may correspond to high prevalence objects.

In one embodiment, the method 300 may further include determiningwhether to perform a lookup operation on the database. As an option, aBloom filter may be utilized to determine whether to perform the lookupoperation on the database. In this case, the Bloom filter may be storedon a client system. The Bloom filter may also be associated with and/orrepresent a blacklist.

Furthermore, a server system may be configured to update the Bloomfilter stored on the client system. in this case, the updating mayinclude pushing hashes stored as Bloom filter bit vectors to the clientsystem. As an option, Bloom filter updates may be sent along withadditional software updates.

More illustrative information will now be set forth regarding variousoptional architectures and features with which the foregoing techniquemay or may not be implemented, per the desires of the user. It should bestrongly noted that the following information is set forth forillustrative purposes and should not be construed as limiting in anymanner. Any of the following features may be optionally incorporatedwith or without the exclusion of other features described.

FIG. 4 shows a method 400 for reducing a number of signature lookupsrequired by a system, in accordance with one embodiment. As an option,the method 400 may be implemented in the context of the architecture andenvironment of FIGS. 1-3. Of course, however, the method 400 may becarried out in any desired environment. It should also be noted that theaforementioned definitions may apply during the present description.

As shown, a signature lookup request is received. See operation 402. Thesignature lookup request is then analyzed using a Bloom filter stored ona client device. See operation 404.

It is then determined whether to perform a lookup based on the analysis.See operation 406. If it is determined that a lookup is to be performed,the lookup is performed. See operation 408.

Thus, it may be determined whether to perform a lookup operation on adatabase utilizing a Bloom filter. In this case, the Bloom filter may bestored on a client system. The client system may include a computer, aPDA, a mobile phone, or any other type of computing device.

Furthermore, a server system may update the Bloom filter stored on theclient system. In this case, the updating may include pushing hashesstored as Bloom filter bit vectors to the client system. As an option,Bloom filter updates may be sent along with additional software updates.Furthermore, the Bloom filter may be associated with a blacklist.

Using the method 400, the number of network based lookups in a networkmay be reduced. For example, in many systems, the number of networkbased lookups in a cloud is very high. This is because the network basedlookups involve signature lookups across a network for each file scannedon a system (e.g. a client system, etc.).

In some cases, however, it may be desirable to keep the lookup rate toless than a certain number of lookups per day. For example, it may bedesirable to keep the lookup rate to less than ten lookups per day perclient. To accomplish this, harsh criteria is often used to keep thelookup rates low. As a result, many problematic items (e.g. malware,etc.) are not examined and such items are missed on the client systems.

By performing a lookup of file signatures that are available locally ona client machine, better results may be achieved. In some cases, anantivirus DAT (AV DAT) based scanning model may be implemented. In thesecases, the high number of specific signatures based on checksum or hashfunctions (e.g. Cyclic Redundancy Check, Message Digest Algorithm, etc.)in the DAT set may inflate the size of the DATs, making the DAT setcomputationally and economically infeasible. For example, large memoryfootprints of DATs may be a challenge on some low memory systems and onsystems with slower connectivity to the Internet for downloading thesefiles.

Additionally, the DAT releases may have to be very frequent to achievethe existing performance levels of real time lookups. Thus, there is aneed for a relatively smaller DAT size that is sufficient to determineif the MD5 being looked up is present in a blacklist database.

Accordingly, in one embodiment, a set membership data structure may beutilized, such as a Bloom filter, that provides large savings in space,potentially at the expense of false positives. As an option, the MD5hashes may be stored as a Bloom filter bit vector and pushed veryfrequently to the client system. Since, in some cases, an MD5 lookup mayonly determine if the MD5 for a given file is present in the bad fileset, this lookup may be accomplished with the local bloom filter.

As the hit needs to be confirmed by a lookup, Bloom filters may be usedto determine if a lookup should occur. Where false rates are not anissue, Bloom filters that have a high compression, and therefore a verysmall size, may be utilized. Additionally, in one embodiment, Bloomfilter updates may be streamed between DAT releases to ensure these arenear to real time lookups.

FIG. 5 shows a system 500 for reducing a number of signature lookupsrequired by a system and for segmenting a database based, at least inpart, on a prevalence associated with known objects included in thedatabase, in accordance with one embodiment. As an option, the system500 may be implemented in the context of the architecture andenvironment of FIGS. 1-4. Of course, however, the system 500 may beimplemented in the context of any desired environment. Once again, theaforementioned definitions may apply during the present description.

As shown, the system 500 may include memory 502. The memory 502 may beallocated to a database dedicated to a whitelist database 504 and/or ablacklist database 506. The whitelist database 504 and the blacklistdatabase 506 may be located on a client system 508. The client system508 may be in communication with a server system 510 over a network 512.

As shown, the whitelist database 504 may be segmented into segments thatinclude high prevalence objects and low prevalence objects. This may beimplemented to store a large whitelist of programs on a host in a memoryefficient way.

For example, in the context of network security, the threat landscapehas grown exponentially and continues to grow. The threats have grown somuch that almost all the Anti-Virus (AV) vendors are currentlyevaluating and implementing various technologies to mitigate theunmatched growth in number of threats. Behavioral detection, automatedsignature creation, heuristic detections, black listing packers, etc.are few of the recent innovations that most of AV vendors implement.

Much of these innovations use the “black listing” approach, where athreat is detected and mitigated if it is known to be of maliciousnature. These threats may include a file, a network packet, a particularbehavior, etc.

Black listing generally only promises to detect a threat if the threatwas analyzed before by the AV provider and has been deemed malign. Oneissue with this approach is the increasing number of signatures a clientside solution needs to carry every time an AV vendor analyzes and deemsa file/network packet/behavior as malign. These are typically deliveredto the client computers in the form of signature updates.

The exponential growth in the threat landscape has also resulted in anexponential growth in signatures carried by these AV solutions. Asanother approach, a white listing technique may be implemented to keep asystem free from threats. These systems are generally based on thepremise that anything not known could be malicious.

This technology is generally more pro-active in mitigating the threatsas anything new entering a system is deemed suspicious. This techniqueagain calls for the client systems to carry the most recent updates of awhitelist database. A whitelist database would carry signatures of allthe files that are known to be benign. However, with time andadvancements in technology, the number of “good” files is also expectedto increase. Thus, white listing may also see exponentially increasingupdates.

Also, new proactive techniques may have higher than usual false positiverates. To mitigate this, a technique may be implemented to store a largewhitelist of programs on a host in a memory efficient way. Therefore,the ability to store a large whitelist on a host allows proactivetechniques to be more aggressive against new and potentially unseenmalware samples.

In the case of good files, or the files that are benign in nature, awhitelist database may be segmented into several parts, including highprevalence parts and low prevalence parts. In one embodiment, thisinformation may be collected through the client side AV softwarewhenever programs are executed on the system. A first segment maycontain signatures for files that are most prevalent (e.g. informationon all Microsoft Office files Adobe files, all system .dlls loaded bythese applications, etc.). A second segment may contain signatures forthe files that are not prevalent.

By nature of the design of a computing system, the first segment maycontain files released as part of operating systems and as part ofwidely used software applications. In general, this set of files wouldnot change frequently. As an option, these files may be delivered as abloom filter bit vector, representing the MD5 values of all the whitelisted files to the client systems incrementally at longer intervals oftime. In this way, the overhead of delivering large sized signaturefiles may be reduced.

In one embodiment, Bloom filters may be used as a whitelist to offset ahigh false positive rate an aggressive proactive test may introduce. Forexample, if a data mining technique is used to detect 90% tpr at 1% fpr,a Bloom filter may be used to make sure that the 90% do not containknown good applications. In this way, as a worst case, the heuristicwould be ineffective if the Bloom filter has a false positive.

Furthermore, with respect to the system 500, the blacklist 506 may berepresented by a set membership data structure, such as a Bloom filter,that provides large savings in space. The MD5 hashes may be stored as aBloom filter bit vector and pushed very frequently to the client system508 by the server system 510 over the network 512. Since, in some cases,an MD5 lookup may only determine if the MD5 for a given file is presentin the bad file set, this lookup may be accomplished with the localbloom filter.

In this way, the Bloom filter may be used to determine if a lookupshould occur. As an option, Bloom filters that have a high compressionand a small size, may be utilized. In one embodiment, Bloom filterupdates may be streamed between DAT releases provided by the serversystem 510.

While various embodiments have been described above, it should beunderstood that they have been presented by way of example only, and notlimitation. Thus, the breadth and scope of a preferred embodiment shouldnot be limited by any of the above-described exemplary embodiments, butshould be defined only in accordance with the following claims and theirequivalents.

1-20. (canceled)
 21. A method, comprising: accessing a database that isprovided for a client system; and assigning a plurality of known objectsto one of a plurality of segments based, at least in part, on prevalenceinformation associated with each of the plurality of known objects,wherein the database includes whitelisted objects, which arenon-malicious objects and which are defined utilizing a Bloom filter ofthe client system, and wherein the Bloom filter is further configured todetermine whether to perform a lookup operation on the database.
 22. Themethod of claim 21, wherein the prevalence information is obtained usingantivirus software of the client system.
 23. The method of claim 21,wherein the database includes a whitelist database.
 24. The method ofclaim 21, wherein an antivirus DAT model is implemented in conjunctionwith the Bloom filter and the database.
 25. The method of claim 21,wherein the Bloom filter is implemented in conjunction with MD5 hashes,which are communicated to the client system.
 26. The method of claim 21,wherein Bloom filter updates are streamed to the client system betweenDAT releases.
 27. The method of claim 21, wherein the Bloom filter isutilized as a whitelist to offset a false positive rate.
 28. The methodof claim 21, wherein a server updates the Bloom filter for the clientsystem.
 29. The method of claim 21, wherein updating the Bloom filterincludes communicating hashes stored as Bloom filter bit vectors to theclient system.
 30. The method of claim 21, wherein Bloom filter updatesare sent along as part of additional software updates for the clientsystem.
 31. The method of claim 21, wherein the Bloom filter is utilizedas a blacklist.
 32. The method of claim 21, further comprising:receiving a signature lookup request; and analyzing the signature lookuprequest using the Bloom filter.
 33. The method of claim 21, wherein theclient system is a selected one of a group of client devices, the groupconsisting of: a) a computer; b) a personal digital assistant (PDA); andc) a mobile phone.
 34. The method of claim 21, wherein Bloom filter bitvectors, representing MD5 values of whitelist files, are provided to theclient system.
 35. A method, comprising: receiving a signature lookuprequest; analyzing the signature lookup request using a Bloom filterprovided for a client system; and performing a lookup in a databasebased on the signature lookup request, wherein the database comprises ablacklist database and a whitelist database, which is segmented intohigh prevalence objects and low prevalence objects, wherein the highprevalence objects reflect objects that are accessed more frequentlythan objects included in the low prevalence objects.
 36. A clientsystem, comprising: a processor; and a memory coupled to the processor,wherein the client system is configured to: access a database that isprovided for the client system; and assign a plurality of known objectsto one of a plurality of segments based, at least in part, on prevalenceinformation associated with each of the plurality of known objects,wherein the database includes whitelisted objects, which arenon-malicious objects and which are defined utilizing a Bloom filter ofthe client system, and wherein the Bloom filter is further configured todetermine whether to perform a lookup operation on the database.
 37. Theclient system of claim 36, wherein an antivirus DAT model is implementedin conjunction with the Bloom filter and the database.
 38. The clientsystem of claim 36, wherein the Bloom filter is implemented inconjunction with MD5 hashes, which are communicated to the clientsystem, and wherein Bloom filter updates are streamed to the clientsystem between DAT releases.
 39. The client system of claim 36, whereinthe client system is further configured to: receive a signature lookuprequest; and analyze the signature lookup request using the Bloomfilter.
 40. The client system of claim 36, wherein Bloom filter bitvectors, representing MD5 values of whitelist files, are provided to theclient system.